MOBFinder: a tool for mobilization typing of plasmid metagenomic fragments based on a language model

Abstract Background Mobilization typing (MOB) is a classification scheme for plasmid genomes based on their relaxase gene. The host ranges of plasmids of different MOB categories are diverse, and MOB is crucial for investigating plasmid mobilization, especially the transmission of resistance genes and virulence factors. However, MOB typing of plasmid metagenomic data is challenging due to the highly fragmented characteristics of metagenomic contigs. Results We developed MOBFinder, an 11-class classifier, for categorizing plasmid fragments into 10 MOB types and a nonmobilizable category. We first performed MOB typing to classify complete plasmid genomes according to relaxase information and then constructed an artificial benchmark dataset of plasmid metagenomic fragments (PMFs) from those complete plasmid genomes whose MOB types are well annotated. Next, based on natural language models, we used word vectors to characterize the PMFs. Several random forest classification models were trained and integrated to predict fragments of different lengths. Evaluating the tool using the benchmark dataset, we found that MOBFinder outperforms previous tools such as MOBscan and MOB-suite, with an overall accuracy approximately 59% higher than that of MOB-suite. Moreover, the balanced accuracy, harmonic mean, and F1-score reached up to 99% for some MOB types. When applied to a cohort of patients with type 2 diabetes (T2D), MOBFinder offered insights suggesting that the MOBF type plasmid, which is widely present in Escherichia and Klebsiella, and the MOBQ type plasmid might accelerate antibiotic resistance transmission in patients with T2D. Conclusions To the best of our knowledge, MOBFinder is the first tool for MOB typing of PMFs. The tool is freely available at https://github.com/FengTaoSMU/MOBFinder.


Introduction
Plasmids are usually small, double-stranded, and circular DNA molecules found within bacterial cells [1].Being separate from the bacterial chromosome, plasmids have the ability to replicate independently and can be transferred between bacteria through conjugation [2].Bacteria, specifically pathogenic strains, can acquire antibiotic resistance genes or virulence factors via plasmid-mediated horizontal gene transfer, aiding their ability to adapt to various environments [3].
Plasmid classification is important for investigating multiple properties of plasmids, such as host range, replication patterns, and mobilization mechanisms [4].Many classification schemes have been developed according to the distinct characteristics of plasmids, including taxonomic classification, replicon typing (Rep), incompatibility typing (Inc), mate-pair formation typing (MPF), and mobilization typing (MOB).In taxonomic classification, plasmids are categorized based on their host bacteria [5].Rep typing classifies plasmids according to genes controlling their replication, known as replication initiation genes [4,6].Inc typing takes advantage of the fact that plasmids with similar replication or partition systems are incompatible within the same cell, categorizing plasmids based on compatibility [6].MPF typing is based on genes encoding the MPF system, which consists of proteins that mediate contact and DNA exchange between donor and recipient cells during conjugation [4,7].Finally, MOB typing classifies plasmids based on the relaxase gene, which is present in all transmissible plasmids [8][9][10].And plasmids with different relaxase types are categorized as different MOB types, each of possesses a distinct transmission mechanism that determines its taxonomic host range [4,11].This variation among different MOB types is critical in researching the spread of virulence traits, the emergence of antibiotic resistance, and the adaptation and evolution of bacteria.Moreover, MOB typing has been found to be effective for identifying novel mobilizable plasmids that were previously unassigned to any Rep or Inc types, and for investigating the mobilization characteristics of plasmids with similar mobilization systems [12,13].
Recently, many experimental and computational schemes have been devised for plasmid typing, as well as to explore the diversity and functionality of plasmids (Table 1).For example, plasmid taxonomic PCR (PlasTax-PCR) [14], PCR-based replicon typing (PBRT) [15], and degenerate primer MOB typing (DPMT) [12] are multiplex PCR methods for identifying plasmids with analogous replication or mobilization systems.PlasTrans, based on deep learning, identifies mobilizable metagenomic plasmid fragments [16].Web servers such as PlasmidFinder [6], pMLST, and oriTfinder [17] were established based on collected maker gene databases and alignment-based methods to facilitate Rep, Inc, and MOB typing.COPLA [5], based on average nucleotide identity, performs taxonomic classifications of complete plasmid genomes with an overall accuracy of 41%.
For the MOB typing, MOBscan [18] uses the HMMER model to annotate relaxase genes and classify plasmids accordingly.MOB-suite [19,20] performs plasmid typing for plasmid assemblies.
First, it uses Mash distance to cluster plasmid assemblies into clusters; then, it uses marker gene databases to annotate them.Table 1.Experimental and computational schemes developed for plasmid classification.
Metagenomic sequencing makes it possible to obtain all plasmid DNA from microbial communities at once, and a number of computational tools for identifying plasmid fragments from metagenomic data have been developed, such as PlasFlow [21], PlasmidSeeker [22], PlasClass [23], PPR-Meta [24] and PlasForest [25].As DNA fragments of plasmids and bacteria are intermingled in metagenomic data [26], recognizing the transmission mechanisms and host ranges of plasmids can be challenging.To this end, it is crucial to annotate MOB types of metagenomic plasmid fragments.
However, this is difficult when plasmid assembly fragments are incomplete and essential genes for annotation are lacking.Therefore, it is worthwhile to consider alternative methods.Given that plasmids of the same MOB type have similar transmission mechanisms and host ranges, their genomic signatures (e.g., GC content and codon usage) tend to also be alike, not only relaxase [4,27].In this context, neural networks, which have demonstrated strong performance in the classification and identification of biological sequences [28,29] could be useful.Furthermore, language models [30,31] derived from such neural networks have also showcased their impressive ability to characterize sequence features [32,33].In this methodology, short sequences of nucleotides (referred to as k-mers) or amino acids are analogous to "words", and the longer sequences of DNA or proteins are analogous to "sentences".Through the application of unsupervised learning on large datasets, each "word" is linked to a feature vector that captures its context, offering a more sophisticated analysis than the traditional k-mer frequency method, which simply counts the occurrence of nucleotide sequences without acknowledging their biochemical characteristics.Unlike the conventional method, this language model-based approach assesses sequences based on their contextual importance across different genetic environments, positioning contextually similar sequences close together in a multidimensional space.This technique provides deeper insights into the biochemical complexities of nucleotide sequences, thereby furnishing a more comprehensive understanding of an organism's functional biology [34].To characterize the features of plasmids within the same MOB type, we employed language models to perform the MOB annotation.In addition to the relaxase-coding gene, language models exhibit the ability to capture more biological features and associations within comparable mobilization systems, making it possible to perform MOB annotation for metagenomic plasmid assemblies.(3) Classification model ensemble and optimization.Several classification models, specifically designed for different lengths, were trained and integrated to predict fragments of different lengths.
Evaluations against a test dataset demonstrated that MOBFinder is a powerful tool for MOB typing of plasmid fragments and bins.Its application to a cohort of patients with type II diabetes (T2D) revealed a potential correlation between some MOB types and the spread of antibiotic resistance genes among T2D patients.This suggests that MOBFinder is an effective data analysis approach for investigating plasmid-mediated horizontal gene transfer within microbial communities.

The workflow of MOBFinder
To annotate the MOB type of plasmid fragments in metagenomics, we designed MOBFinder (Figure 1).As MOB-suite [19,20] didn't offer a quantitative likelihood score for the outcomes and some plasmids would be classified into multiple MOB types (Figure S1), we constructed a benchmark dataset using a high-resolution MOB typing strategy for categorizing complete plasmid genomes (Figure 1B, 1C).Then, based on a language model and random forest, we designed an algorithm to perform MOB typing for PMFs (Figure 1D, 1E).from the NCBI were subjected to MOB typing.(D) Those complete genomes were also used to train a 4-mer language model using the skip-gram algorithm, allowing each 4-mer to be represented by a 100-dimensional word vector.For a DNA fragment, the average word vector of all 4-mers on its sequence serves as the feature vector for that DNA.(E) We constructed simulated metagenomic contigs from the complete genomes that had been MOB typed as a benchmark and encoded these contigs into word vectors.Then these word vectors were used to train a random forest algorithm.
Then the trained model, with metagenomic DNA fragments as input, was used to predict the MOB typing of the corresponding DNA fragment based on its word vectors.

MOB typing of complete plasmid genomes
Traditionally, plasmid MOB typing of complete plasmid genomes has been a bioinformatics task based on the analysis of relaxase sequence similarity.The practice of annotating MOB types through BLAST similarity searches using representative sequences of different MOB type relaxases has gradually evolved into the standard method for MOB typing [4,19,20].In this work, we constructed a benchmark dataset of simulated metagenomic contigs based on complete plasmid genomes with known MOB types.Previous studies have included a relatively small number of plasmids in their analyses.To further expand the MOB typing training dataset, we annotated the newly collected plasmid complete genome data for MOB typing according to relaxase information.
However, employing these criteria, we observed that some relaxases were annotated as belonging to multiple MOB types.To eliminate ambiguous annotations and construct a more reliable dataset for the training of MOBFinder, we imposed the stricter criteria mentioned above.After the expansion of protein sequences, local relaxase databases were built using the 'makeblastdb' command for MOB typing of plasmid genomes.
Plasmid genomes were retrieved from the NCBI nucleotide database using the keywords 'complete' and 'plasmid,' and incomplete fragments were removed manually for further analysis.The accession list of these plasmids is provided in Supplementary Table 1.For each plasmid genome, coding sequences were extracted from the genebank file, and blastp [37] was employed to search for the best alignment of local relaxase databases.Here, we defined the mob_score to measure the likelihood of homology: where qcov_max and bitscore_max represent the query coverage and bitscore corresponding to the match with the highest bit score, respectively.To identify plasmid genomes encoding known relaxase families, we set a mob_score threshold of 0.5, which was established in conjunction with a minimum query coverage of 50% and a minimum bitscore of 100.To further enhance the reliability of our classification, we introduced an e-value cutoff, conservatively set at 1e-10, to complete the plasmid genome classification (Figure 1C).In instances where plasmid genomes yielded no blast results or exhibited an e-value exceeding 0.01, we categorized them as non-MOB.

Word embeddings using a language model
To characterize the features and patterns within each MOB category and use numerical word vectors to represent them, we utilized a skip-gram language model [30,31] to learn from plasmid genomes.
Using a sliding window, the model calculated the likelihood between segmented words and outputted a probability distribution over the context words.The training steps were as follows (Figure 1D): (1) Word generation.Since DNA sequences are composed of different nucleotide characters, we used a k-mer sliding window to generate overlapping input words.For example, with k=4， 'ATCGCTGA' would be segmented into 'ATCG,' 'TCGC,' 'CGCT,' 'GCTG,' and 'CTGA'.In this step, unique words were generated.
(2) Word encoding initialization.Each word was initially assigned a random vector.
(3) Skip-gram model.We employed a standard skip-gram model as described in previous studies [30,31] to generate word vectors through the dna2vec module [31].A two-layer neural network was used to construct the skip-gram model.The initialized vectors were used as input, and the output was a probability distribution over the input words.Layer 1 was a hidden layer to convert the initialized vectors into a 100-dimensional word vector representation as predefined by Ng [31].
Layer 2 was used to compute and maximize the probability of the correct context words using the negative sampling function, with the size of context words set to 20 (10 words for upstream and downstream, respectively) as pre-set by Ng [31].
(4) Model training.For each input plasmid genome, we used an optimization algorithm to minimize the loss function.Then, using the default settings, we used backpropagation to update the neural network parameters (word vectors) for 10 epochs.
(5) Word vector extraction.After the training process, the word vectors in the hidden layer were extracted to characterize the plasmid fragments.

Benchmark dataset construction
Because there are no real metagenomic data to serve as a benchmark, using simulated data as a benchmark dataset is a common approach when developing bioinformatics tools [16,24].Therefore, in the development of MOBFinder, we artificially generated simulated datasets through the following steps: (1) For classified plasmid genomes in each MOB category, we randomly split them at a proportion of 70% and 30% to construct the training and test datasets.
(2) Training dataset.To predict plasmid fragments with different lengths, we generated contigs of different length ranges: 100-400 bp, 401-800 bp, 801-1200 bp, and 1201-1600 bp.For each MOB class in each length range, we randomly generated 90000 artificial contigs.Plasmid fragments longer than 1600 bp were segmented into shorter contigs and predicted using models designed for the corresponding lengths.
(3) Test dataset.Because some plasmid fragments in real metagenomics datasets were much longer, we generated four length groups to assess the performance of MOBFinder: Group A with a length range of 801-1200 bp, Group B with a length range of 1201-1600 bp, Group C with a length range of 3000-4000 bp, and Group D with a length range of 5000-10000 bp.For each MOB class in these four groups, 500 fragments were randomly extracted.

Classification algorithm
To efficiently handle the training dataset and improve the robustness of MOBFinder, we employed random forest to train four predictive models using the training dataset.The detailed steps are as follows (Figure 1E): (1) Word representation calculation.For each contig in the training dataset, we used a 4-mer sliding window to generate overlapping words and transformed them into numerical word vectors using trained word embeddings.To characterize the underlying features and patterns of the input contigs, we summed all the word vectors to compute their average as input of random forest.
(2) Classification model training.To improve the performance of MOBFinder, we trained four classification models on different lengths in the training dataset: 100-400 bp, 401-800 bp, 801-1200 bp, and 1201-1600 bp.The number of trees was set to 500 to generate predictive models.
(3) Model ensemble.The four trained models were ensembled into MOBFinder to make more accurate predictions.For fragments shorter than 100 bp, we used a model designed for 100-400 bp to predict the MOB type.For those longer than 1600 bp, we segmented them into short contigs and made predictions using the corresponding model.For example, a fragment with a length of 4000 bp would be segmented into three contigs: two with a length of 1600 bp and one of 800 bp.After predicting fragments with the corresponding models, we aggregated and calculated the weighted average scores for each MOB class, and the MOB type with the highest score was selected as the final prediction result for the input fragment.
(4) Plasmid bin classification.Metagenomic binning is an essential step in the reconstruction of genomes from individual microorganisms.Thus, we designed MOBFinder to perform MOB typing on both plasmid contigs and plasmid bins.If the input is a plasmid bin, MOBFinder predicts the likelihood of each MOB class for fragments within the bin.For each MOB category, MOBFinder aggregates the scores of each sequence within the bin and calculates the weighted average scores based on the sequence length.The MOB category with the maximum score is selected as the prediction result.

Performance validation
A test dataset was used to assess the performance of MOBFinder and compare it to MOB-suite and MOBscan.Because MOBscan can only predict MOB type using plasmid protein sequences rather than DNA sequences, we first annotated the proteins in the plasmid fragments of the test set using Prokka (RRID:SCR_014732) [38] and then used MOBscan to predict the MOB type based on the annotated proteins.We calculated overall accuracy, kappa, and run time by comparing the predicted classes and true classes.We used the online server of MOBscan to perform the MOB annotation, and the calculation of run time for MOBScan was confined to the duration spent on preprocessing with Prokka locally.The overall accuracy was the proportion of accurate predictions.The kappa value was calculated to quantify the performance of MOBFinder.An AUC value between 0.5 and 1 indicates that the model performs better than random chance, and a higher AUC value indicates better prediction capability.

Annotation and analysis of T2D metagenomic data
Metagenomic sequencing data (SRA045646) were retrieved from the NCBI short read archive (SRA) database to investigate whether the plasmids within different MOB classes were associated with antibiotic resistance enrichment in T2D patients, as suggested by previous studies [39,40].All metagenomic data were preprocessed using the same protocols.PRINSEQ (RRID:SCR_005454) [41] was used to remove low-quality reads and bowtie2 (RRID:SCR_016368) [42] was used to remove host reads by aligning them to the human GRCH38 reference genome downloaded from the ENSEMBL database.We excluded metagenomic samples that did not pass quality control.Because the abundance of plasmids in metagenomes was much lower than that of bacteria, we only retained samples with more than 10,000,000 paired-end reads for downstream analysis (Supplementary Table 2).
To improve the efficiency and accuracy of assembly, we used MEGAHIT (RRID:SCR_018551) [43] to generate metagenomic contigs.PPR-Meta (RRID:SCR_016915) [24] was utilized to identify and extract plasmid fragments from the assembled fragments while filtering out bacteria and phage sequences.COCACOLA [44] was employed to cluster plasmid fragments into bins based on sequence similarity and composition.This allowed us to investigate the plasmid fragments from same originate and enabled better annotation and analysis of their functions.
MOBFinder was applied to annotate the MOB types in each plasmid bin.The average fragments per kilobase per million of each plasmid bin was calculated using bowtie2 to represent its abundance.
Next, we analyzed the significance of differences in plasmid bins and various MOB types between healthy and T2D groups using the Wilcoxon rank-sum test.The calculation of p values was adjusted for multiple comparisons using the Benjamini-Hochberg method (denoted as p.adjust).ABRicate (RRID:SCR_021093) [45] was utilized to annotate antibiotic resistance genes (identity>50% and qcov>50%) in each plasmid bin, based on four antibiotic resistance gene databases [46][47][48][49].The Tukey's Honest Significant Difference test was performed to compare the identified resistance genes among different MOB classes.All statistical analyses were conducted using R.

MOB typing of plasmid genomes
To construct the benchmark datasets, we obtained 90,395 complete plasmid genomes and categorized them into 11 MOB categories using blast (

Overall performance of MOBFinder
We evaluated the overall performance of MOBFinder in terms of accuracy, kappa, and run time, and compared the tool to MOBscan and MOB-suite.MOBscan did not perform well, achieving low accuracy and kappa values across sequences of varying lengths, while MOB-suite exhibited marginally better performance than MOBscan when handling sequences of greater length (Figure 3A, 3B).In comparison, the accuracy of MOBFinder ranged from 70% to 77%, a significant improvement of at least 59% over MOB-suite (Figure 3A).The kappa of MOBFinder ranged between 67% and 75% and was approximately 65% higher than that of MOB-suite (Figure 3B).
Moreover, MOBFinder exhibited a shorter run time in the test dataset, with a more gradual increase trend (Figure 3C).In general, these results indicate that MOBFinder greatly outperformed the other tools, and consistently improved in accuracy and consistency as the sequence length increased.

Evaluation by MOB category
Next, to evaluate the discrimination ability of MOBFinder for each MOB type, we calculated the balanced accuracy, harmonic mean, and F1-score using the test dataset (Figure 3D).It demonstrated the highest performance for MOBB and MOBM, while its ability to identify non-MOB types was comparatively low.For MOBM, the balanced accuracy and harmonic mean reached up to 99% and the F1-score exceeded 96% for all length groups.For non-MOB, the balanced accuracy was 65%, the harmonic mean was 49%, and the F1-score was 40%.Compared to MOB-suite, MOBFinder exhibited much better performance in predicting all MOB classes.Even for non-MOB, it showed an approximate 13% improvement over the other tools in terms of balanced accuracy, 34% in terms of harmonic mean, and 24% in terms of F1-score.
In AUC analyses (Figure 4), all values were greater than 0.8, indicating that the tool effectively distinguished between positive and negative samples in each MOB class.In fact, most values were higher than 0.9, except for MOBT and non-MOB.The performance differences by MOB type might be attributable to the differences in host ranges and sequence features among types.Additionally, the imbalance in the training dataset for each MOB type may also be a primary factor contributing to the performance disparities.

Application to T2D metagenomic data
In a previous study, enrichment analysis of fecal samples identified antibiotic resistance pathways in patients with T2D [40].The precise mechanism of this enrichment, however, remained elusive.
We used MOBFinder to analyze real T2D metagenomic data [39].After preprocessing and assembly, 2,217,064 metagenomic fragments were generated, and plasmid assemblies were identified using PPR-Meta.Subsequently, the plasmid fragments were clustered into 55 bins and annotated using MOBFinder.By employing MOBFinder, we assigned 2 bins to the MOBF class, 8 bins to MOBL, 17 bins to MOBQ, and identified 28 bins as non-MOB (Figure 5A).Furthermore, we detected 15 bins that exhibited significant differences between the T2D group and a control group.Among them, 1 bin was classified as MOBF, 2 as MOBL, 5 as MOBQ, and 7 as non-MOB (Figure S2).Among above MOB types, MOBQ contains the highest number of bins enriched in T2D, while MOBF is widely present in Escherichia and Klebsiella (Figure 2B), and that some strains of Klebsiella are resistant to multiple antibiotics, including carbapenems [53], these two MOB types might contribute to antibiotic resistance in T2D patients.Indeed, when we compared the average abundance of each MOB type between the T2D group and the control group (Figure 5B), the abundances of MOBF and MOBQ were significantly greater in the T2D group.In addition, these two MOB types can be transferred among multiple bacterial species.This suggests that an increase in these two MOB types could potentially raise the risk of bacterial infection among individuals with T2D.Subsequently, we used four databases [46][47][48][49] to detect drug resistance genes in four MOB types (Figure 6).The number of such genes was significantly higher in MOBF than in the other three MOB types.This suggests that MOBF plasmids may carry more drug resistance genes than the other MOB types.Furthermore, the increase in MOBF and MOBQ plasmids could result in more bacteria acquiring drug resistance genes, thereby leading to more antibiotic resistance pathways in T2D patients.In summary, our results demonstrate the utility of MOBFinder for annotating plasmid fragments in metagenomes, uncovering the potential mechanisms underlying the antibiotic resistance enrichment in metagenomic analysis.

Discussion
We developed MOBFinder based on a language model and the random forest algorithm to classify plasmid fragments and bins from metagenomics data into MOB types.First, using the relaxasealignment method, plasmid genomes were classified into distinct MOB categories.Analyses revealed substantial differences in parameters such as the number, average length, and GC content of plasmid genomes across MOB types.Additionally, there were noteworthy differences in the host ranges among different MOB classes.These results suggest the potential of utilizing sequence features from different MOB types for PMF MOB typing.To characterize the plasmids within each MOB type, we used the skip-gram model to generate word vectors.Our tool demonstrated superior overall performance compared to other tools.Specifically, for each MOB category, MOBFinder exhibited significant improvements in balanced accuracy, harmonic mean, and F1-score, with values reaching up to 99% for the first two measures in the MOBM category.
Traditionally, k-mer frequency models and one-hot encoding have commonly been employed to digitize biological sequences, extensively applied across various machine learning algorithms [54].
However, both models simply mark or count the frequency of various characters in sequences, failing to reflect the biological significance underlying each character.These models may also encounter dimensionality issues [54].For instance, in the k-mer model, if k is set to 8, the dimensionality of the k-mer vector of each DNA sequence becomes 4 8 , which is problematic in metagenomics where most fragment lengths do not reach this magnitude.This would result in significant noise in the feature vector and cause overfitting.Similarly, in the one-hot model, for a sequence of length L using 4-mers as the base unit, it would require L one-hot vectors each with a dimensionality of 4 4 .In such instances, if the dataset for training is not sufficiently large, this representation method could also lead to overfitting due to high dimensionality.In contrast, word vector models offer a superior solution to these problems.Such models initially perform a random initialization of vectors for each "word."Taking the skip-gram algorithm utilized in this study as an example, the dimension of a random vector can be 1-of-n, where n represents the size of the vocabulary [30].Following unsupervised pre-training on large datasets, the algorithm maps characters with similar contexts to similar feature spaces.The dimensions of the coordinates (i.e., the word vectors) of these feature spaces will be lower than those of the initial random vectors.Thus, through unsupervised pre-training on large datasets, language models can compress highdimensional initial vectors into lower-dimensional word vectors (e.g., MOBFinder's word vectors have a dimensionality of 100), enabling the feature vectors to contain more character information while effectively avoiding dimensionality issues during supervised training.
In a metagenomic sequences classification task, 4-mer is widely used as the basic unit in various bioinformatics tools [55], thus MOBFinder takes this as a "word."To assess the impact of training word vectors with different k-mer lengths on performance, we compared models with k-mer lengths of 2, 3, 4, 5, 6, 7, and 8 (Figure S3).We observed lower overall accuracy and kappa values for k=2.
At k=4, the balanced accuracy, harmonic mean, F1-score, and AUC values stabilized across different MOB types.Subsequently, as the k-mer length increased, there was no significant improvement in accuracy or other metrics, while the run time gradually increased.Therefore, we chose a k-mer length of 4 for training word vectors and developing MOBFinder.
Interestingly, in an analysis of T2D metagenomic sequencing data [39], we noted a significant increase in MOBF and MOBQ type plasmids in T2D patients.Moreover, we found more drug resistance genes in the MOBF class, whose dominant hosts are Klebsiella and Escherichia, which are associated with the spread of multidrug resistance.Although previous analyses of gut metagenomic data from patients with T2D have reported enrichment of drug resistance pathways [40], our results suggest a potential reason for it: the increased abundance of MOBF and MOBQ type plasmids in the guts of individuals with T2D may disseminate more antibiotic resistance genes, resulting in such enrichment.
At present, databases contain a large amount of human metagenomic data derived from secondgeneration sequencing.However, understanding of the functions of numerous disease-linked microbial sequences remains limited, attributable to the incomplete nature of metagenomic fragments.The development of MOBFinder enables MOB annotation for plasmid fragments from metagenomics data and provides a powerful tool for investigating the transmission mechanisms of plasmid-mediated antibiotic resistance genes and virulence factors.

Conclusions
In summary, MOBFinder is a tool for MOB typing of plasmid fragments and bins from metagenomic data.Analyses of classified plasmid genomes unveiled notable differences in sequence characteristics and host ranges across MOB types.Hence, we employed a language model to extract the sequence features specific to each MOB type and represented them using word vectors.
Additionally, we boosted prediction accuracy by training and integrating several random forest classification models.MOBFinder surpassed other tools in performance tests and successfully detected an increase in certain MOB type plasmids in T2D patients.Importantly, these MOB type plasmids harbor potential drug-resistance genes, thus offering an explanation for the observed antibiotic resistance in T2D individuals.This suggests that MOBFinder could potentially aid the formulation of specific medications to curb drug resistance transmission.We anticipate that MOBFinder will be a powerful tool for the analysis of plasmid-mediated transmission.Click here to access/download;Figure Click here to access/download;Figure Click here to access/download;Figure Click here to access/download;Figure Click here to access/download;Figure Click here to access/download;Figure Also, all the related scripts and data have also submitted to the GigaDB server.We apologize for the confusion between the Wilcoxon rank-sum test and the Wilcoxon signed-rank test.We have corrected this mistake in the revised manuscript.
In Line 368 and 472 of the revised manuscript, the statistical method has been corrected to "the Wilcoxon rank-sum test".

Since MOBscan can only predict the MOB type with plasmid proteins, we annotated the plasmids in the test set with Prokka, then manually submitted them to the MOBscan website for MOB type annotation.
Given that MOBScan operates as an online tool and cannot be executed locally, the calculation of MOBScan's run time was confined to the duration spent on preprocessing with Prokka locally."(Please refer to Line 313-319 in the revised manuscript).
-> Actually, it can be executed locally using the scripts included in https://github.com/santirdnd/COPLA/.It may not be necessary to run MOBscan locally (it may be okay that they manually submitted them to the MOBscan website), but I'll inform you regardless.
We are very grateful to Reviewer 1 for reminding us that MOBscan can be run locally.
In the revised manuscript, we have removed the statement "MOBscan can only predict the MOB type with plasmid proteins" and revised the corresponding description to "We used the online server of MOBscan to perform the MOB annotation, and…" (Please refer to Line 312-313 in the revised manuscript).

Line 418-421
In the comparison, it was observed that MOBscan did not perform well, achieving low accuracy and kappa values across sequences of varying lengths, while MOB-suite exhibited marginally better performance than MOBscan when handling sequences of greater length (Figure 3A, 3B).(Please refer to Line 418-421 in the revised manuscript).
-> Do the authors' results contradict the following general expectation?MOB-typer utilizes BLAST, whereas MOBscan utilizes hmmscan, and therefore, MOBscan is expected to retrieve more distantly related proteins than MOB-typer.
We would like to thank Reviewer 1 for the discussion regarding BLAST and HMMscan.Firstly, we acknowledge that for more distantly related proteins, sequence searching based on HMM exhibits higher sensitivity than BLAST.However, we believe that our results do not contradict this theory.There are two reasons that might explain why, in this manuscript, the performance of tools based on HMM appears slightly inferior to those based on BLAST.
(1) The number of reference sequences can impact the performance of the tools.In MOB-suite, a large number of reference sequences are used for BLAST sequence alignment, whereas in MOBscan, the number of relaxase sequences used to profile HMM files for some MOB types is not very large.For instance, for the MOBF type, MOBscan utilizes 146 relaxase sequences for configuring HMM files, while MOBsuite employs 396 sequences to construct the BLAST database.The difference in the number of reference sequences could potentially lead to MOBscan's performance being slightly inferior to that of MOB-suite.
(2) The aim of this study is not to design new methods for identifying novel relaxases.
In our test data, the relaxases all come from sequenced plasmids, so there is some homology with the relaxases in the database.When the query sequence and the database sequence have high homology, the performance of BLAST may not necessarily be worse than HMM.In fact, existing studies have shown that methods based on BLAST can sometimes outperform those based on HMM when the homology is high (Ref: PMID: 25140992).
4. MOB-suit and MOBscan are represented by blue lines, orange lines and gray lines respectively.
-> should be "MOB-suite" We thank Reviewer 1 for the careful checking of the manuscript.In Line 427-428 of the revised manuscript, we have revised "MOB-suit" to "MOB-suite".
5. I suggest receiving English language editing before publishing the paper.
"For the MOB typing, MOBscan [18] uses the HMMER model to annotated the relaxases and further perform MOB typing."-> should be "For the MOB typing, MOBscan [18] uses the HMMER model to annotate the relaxases and further perform MOB typing." We are sorry for the grammatical error.In Line 109 of the revised manuscript, we have revised the sentence as "For the MOB typing, MOBscan [18] uses the HMMER model to annotate relaxase genes and classify plasmids accordingly".
In addition, we have used the copy editing service of the GigaScience journal to refine the language through the whole manuscript.Here, we would like to express our sincere gratitude to Reviewer 2 for the positive comments on our work, describing our revised manuscript: "The manuscript has been improved substantially and all the initial concerns have been addressed satisfactorily."

To Reviewer #2:
We are very thankful for Reviewer 2's suggestions during the first revision process, which greatly enhanced the clarity, depth, and academic value of our paper.
In hoping that the above revision has clarified all the points by two reviewers and given a point-by-point response to all the concerns, we hereby submit our revised manuscript to the journal.We thank you for your kind consideration.
Note: The initial version of the article has been made public on bioRxiv (https://doi.org/10.1101/2023.12.06.570414), and beyond that, it has not been published in any other traditional journals.
Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, China Email: fangzc@smu.edu.cn Thus, we presented MOBFinder, a tool for annotating MOB types from plasmid metagenomic fragments (PMFs).MOBFinder can process single or multiple plasmid DNA sequences, and provides predicted MOB types for each input fragment, including MOBB, MOBC, MOBF, MOBH, MOBL, MOBM, MOBP, MOBQ, MOBT, MOBV and non-MOB.Moreover, it provides the option to annotate plasmid bins from metagenomics data.An overview of this work is shown in Figure 1A, and the development of MOBFinder involved the following steps: (1) Benchmark dataset construction.Plasmid complete genomes obtained from the National Center for Biotechnology Information (NCBI) were classified into different MOB types based on relaxase databases.Then, to simulate plasmid fragments in metagenomic data, an artificial benchmark dataset of varying lengths is generated.(2) Word embeddings.Numerical word vectors were generated using skip-gram to characterize the sequence features of different MOB categories.

Figure 1 .
Figure 1.Flowchart of the technical approach utilized in this study.(A) General workflow of the development and testing of MOBFinder.(B) Using plasmid relaxases with known MOB types as reference sequences, we developed a database of relaxases from the non-redundant (NR) database representing different MOB types.(C) Utilizing the relaxase database, complete plasmid genomes (a) was calculated to assess the overall consistency between the predictions and true classes, which took into account the possibility of random prediction.Po represented observed accuracy [Po = (A11 + A22 + ... + Ann) / N], where A11, A22, and Ann represented the values on the diagonal of the confusion matrix and n represented the number of MOB categories.N represented the total number of samples.Pe represented the expected accuracy [Pe = (E11 + E22 + ... + Enn) / N^2], where E11, E22, and Enn were the expected values in each cell of the confusion matrix, n was the number of MOB classes, and N was the total number of samples.The run time was recorded using the command 'time' in Linux. = ( − )/(1 − ) ()   = ( + )/2 () ℎ  = 2 *  * /( + ) () 1 −  = 2 *  * /( + ) () For each MOB category, we also calculated the balanced accuracy (b), harmonic mean (c) and F1score (d).Considering the class imbalance within the training dataset, balanced accuracy was used to measure the average accuracy of each MOB category, where TPR was the true positive rate [TRP = true positives / (true positives + false negatives)] and TNR was the true negative rate [TNR = true negatives / (true negatives + false positives)].The harmonic mean provided an overall evaluation of the model's performance, where Sn and Sp represented sensitivity [Sn = true positives/ (true positives + false negatives)] and specificity [Sp = true negatives / (true negatives + false positives)], respectively.The F1-score combined precision and recall, providing a balanced measure of the model's performance, where precision was the number of correct positive predictions out of all positive predictions [precision = true positives / (true positives + false positives)] and recall was the number of correct positive predictions out of all actual positive predictions.[recall = true positives / (true positives + false negatives]].A receiver operating characteristic (ROC) curve was used to visualize the performance of MOBFinder in predicting each MOB category, where the x-axis and y-axis were the false positive rate (FPR) and true positive rate (TPR).Plots closer to the left and top indicate higher TPR and lower FPR, which means better performance.For each MOB class, the area under the curve (AUC)

Figure 2 .
Figure 2. Benchmark dataset construction using a high-resolution strategy.(A) Proportion of

Figure 3 .
Figure 3. Overall performance of MOBFinder and comparison to MOB-suite and MOBScan.Evaluation and comparison in terms of (A) accuracy, (B) kappa, and (C) run time (C).The four fragment length groups in the test dataset were Group A (801-1200 bp), Group B (1201-1600 bp), Group C (3000-4000 bp), and Group D (500-10000 bp).(D) For each MOB type, the balanced accuracy, harmonic mean, and F1-score were used to assess the performance of MOBFinder and compared to MOB-suite and MOBscan.Since MOB-suite and MOBscan do not include the prediction of MOBL, only the results of MOBL from MOBFinder are provided.MOBFinder, MOBsuite and MOBscan are represented by blue lines, orange lines and gray lines respectively.

Figure 4 .
Figure 4. ROC curves and AUC values for MOBFinder.The curves were plotted using the output scores of MOBFinder, and the AUC values were calculated to quantify the performance of the tool for each MOB class.

Figure 5 .
Figure 5. Annotation of T2D-related plasmid bins using MOBFinder.(A) Heatmap of plasmid bins between T2D patients and controls.Each column represents a sample, and each row represents a plasmid bin.(B) Comparison of the abundance of the four identified MOB types between T2D patients and controls.The p-value was calculated using the Wilcoxon rank-sum test, adjusted using the Benjamini-Hochberg method for multiple comparisons.(*p.adjust < 0.05, **p.adjust < 0.01, and ***p.adjust< 0.001.)

Figure 6 . 3 . 5 .
Figure 6.Comparison of resistance genes among different MOB types.Four databases were used to identify antibiotic resistance gene within each MOB type, and the p-value was calculated using Tukey's Honest Significant Difference test.The two groups without significance markings indicate no statistical difference.(*p-value < 0.05, **p-value < 0.01 and ***p-value < 0.001.)

Figure 1
Figure 1 Figure6.tifAccording to the journal requirement, we have registered the tool in the bio.tools and SciCrunch.orgdatabases.In Lines 585-586 of the "Availability of Source Code and Requirements" section of the revised manuscript, we have added the following Comments: I would like to commend you on the revisions made to your manuscript following the initial round of reviews.It is evident that considerable effort has been put into addressing the concerns and suggestions raised during the first review.The changes and additions you have implemented have significantly enhanced the clarity, depth, and scholarly value of your paper.The manuscript has been improved substantially and all the initial concerns have been addressed satisfactorily.I support the publication of this manuscript in GigaScience.

Table 2 )
. We removed 22,470 of them potentially classified into more than one MOB class, leaving 67,925 classified genomes for the training and optimization of MOBFinder (Figure2A).Our analysis results revealed significant differences in the number, average length, and GC content of plasmid genomes among MOB types.
[52]bly, non-MOB types included the genomes with the most and longest average length, whereas MOBB and MOBM had the fewest plasmid genomes and shortest average length, respectively.In terms of GC content, MOBL had the lowest and MOBQ had the highest amounts.Moreover, plasmids of different MOB types exhibited diverse host ranges at the genus level (Figure2B).MOBB was predominantly found in Bacteroides, Hymenobacter, Parabacteroides, Phocaeicola and Spirosoma.Particularly, Phocaeicola has been detected in the human gut and possessed the gene for porphyran degradation through horizontal gene transfer[50].MOBC, MOBF, MOBH, and MOBP were all found in Escherichia and Klebsiella.And Klebsiella is a multidrug-resistant bacterium that has demonstrated resistance to multiple antibiotics[51].MOBL, MOBT, and MOBV were mainly discovered in Bacillus and Enterococcus.Almost all MOBM type plasmid genomes were present in Clostridium and Enterocloster, and some species in Clostridium could cause various diseases[52].MOBQ demonstrated a broader host range, including Acinetobacter, Agrobacterium,

Table 2 .
Number, average length, and GC content of plasmid genomes for each MOB type.

Table 1 .
Experimental and computational schemes developed for plasmid classification.

Table 2 .
Number, average length, and GC content of plasmid genomes for each MOB type.